Mining Web Text for Brand Associations
نویسندگان
چکیده
Weblogs and other web text are an incredibly rich knowledge base, and the marketing industry is beginning to recognize the value of web texts as a source of information about their customers. However, the nature of web texts make them unsuitable for analysis using standard market research methods. In this paper, we describe the use of exploratory data analysis techniques to extract the associations patients and their caregivers have for eight leading medications for a seizure disorder. We first collect a list of candidate keywords which occur with a brand name or a variant spelling, and then cluster the keywords to construct a set of brand associations. To compare individual brands we measure the association between each brand name and any term from each of the clusters. Finally, using dimensionality reduction techniques, we plot the brand names, their associations, and their relationships as a brand association map. Web text, such as blogs, newsgroups, message boards, and email lists, can provide an easily collected and incredibly rich source of data on a nearly limitless range of topics. Issues related to health and medicine are particularly well represented: as of October 25, 2005, Yahoo! Groups (groups.yahoo.com) listed 85,349 separate discussion groups under the “Health and Wellness” category, and Technorati (www.technorati.com) lists 79,367 blog posts with the tags “Health”, “Health and wellness,” or “Medical”. People faced with medical problems turn to the Internet for support from other patients and their families, for information about their disease and the possible treatments, for help navigating with the medical establishment, and sometimes for a sympathetic venue to vent their frustrations. Not surprisingly, the marketing industry is beginning to recognize the value of web texts as a source of information about their customers. However, the nature of web texts make them unsuitable for analysis using standard market research techniques. The sheer quantity of data makes comprehensive qualitative analysis impossible. While looking at a small sample of text may yield some insights, that may also lead an analyst to grant to much weight to accidental properties of a small sample, and at the same time to miss subtle patterns which can only be detected by looking at the data set as a whole. Web texts contain a huge amount of information in the aggregate, but individual posts by themselves are rarely informative. Computational linguists and other researchers in the field of text mining have developed a toolkit of statistical methods for analyzing large quantities of text. In this paper, we describe the use of a combination of rule-based and statistical exploratory data analysis techniques to extract brand associations: the feelings, beliefs, knowledge, and attitudes people have towards a product. The focus of the study was to find and document the associations that patients and their caregivers have with eight medications for a seizure disorder. The corpus we are working with is a collection of posts to a number of Internet discussion groups and other websites used by epilepsy patients and their families. The corpus contains a total of 26,062,526 words in 316,373 posts from 19 different sites and 8,731 distinct users. Posts average 119 words each. “Sentiment analysis” is a well-known technique for measure whether the associations with a brand name are generally positive or negative (Turney 2002). This is unlikely to be helpful for looking at brand associations with medication names, however. One reason for this is that the associations with medication names, especially those used to treat a serious condition like epilepsy, are overwhelmingly negative. Even if someone would recommend a treatment to another patient, on balance almost all patients would much prefer not to be taking any medications at all. Another problem with applying sentiment analysis in this domain is that patients turn to on-line communities when they have a problem. Users rarely post to say that their disease is still under control or that they are suffering no noticeable side effects. So, the overall direction of the sentiments expressed in web texts will be strongly biased toward the negative, simply by the nature of the medium. And, finally, a simple negative or positive judgment is too coarse-grained. Market researchers need to know more detail about brand associations, and in particular are interested in what differentiates between competing brands in the minds of consumers. What is needed in this application to make use of web texts for market research is a set of quantitative techniques for extracting brand association patterns from large quantities of unstructured web text. The methods need to be sufficiently automatic that they can be applied to very large quantities of data without human intervention. They also must be flexible enough to allow room for manual intervention by a domain expert when appropriate. Finally, the results need to be visualized in way which makes them comprehensible to someone who is not a text mining expert, and which makes clear what actions should be taken based on the evidence in the web texts. The first step in extracting brand associations is to identify posts which mention one of the target brand names. As a first pass, we could simply perform a keyword search, retrieving all posts which contain one of the brand names. FDA regulations require that medication names not be easily confused with normal words or with each other, which means that this strategy will result in few false positive matches. However, medications are often marketed under a variety of different names. For example, the anti-convulsant Tegretol is known by the generic name “carbamazepine” and is also has been sold under the brand names Amizepine, Carbazepin, Epitol, Finlepsin, and Neurotol. To expand the search to include these alternate names, we extracted the alternate entries for each medication from MeSH, a controlled vocabulary for subject headings developed by the US National Institute of Health’s National Library of Medicine (http://www.nlm.nih.gov/mesh). A more serious problem with a naive keyword matching strategy for finding brand name mentions is that users’ spellings of these words vary widely. Web texts in general are informal and unedited, and are full of idiosyncratic spelling, punctuation, and formatting. In addition, medication names are by nature unfamiliar, difficult to spell words, making the likelihood that they will be spelled correctly in web texts even lower. And, finally, posters often use conventional nicknames for commonly mentioned drugs rather than their full official name. To test the effectiveness of various keyword spotting strategies, we constructed a high-recall list of candidate brand name mentions. Using a finite-state transducer, we extracted all terms from the corpus which were with a Levenshtein distance of 3 or less from one of the MeSH names for carbamazepine (e.g., tegretol, tegreatol), or which is within a Levenshtein of one or less of being a prefix or a suffix (e.g., teg, tege). This produced a list of 644,129 candidate mentions, with 516 unique types. We then filtered this list by hand to find 6,150 genuine brand mentions (with 97 types). Using this hand-corrected list as a gold standard, we can estimate the precision and recall of a variety of keyword spotting techniques. The naive algorithm, simply searching for mentions of “Tegretol,” yields 4,492 hits, for a recall of 73.0%. Adding the alternative names extracted from MeSH increases this to 4,728 hits, for a recall of 76.9%. Adding words with an Levenshtein distance from one of the names of one or less raises recall to 86.0%, with a negligible decrease in precision. Adding words which are a prefix of “Tegretol” further increases recall to 91.7%, still with near perfect precision. Moving to a Levenshtein distance of two further increases the recall, to 96.7%, but the precision drops to a worrying 90.8%. Therefore, in the following analysis, we considered to contain a mention of a brand name all posts with a which is word within an Levenshtein distance of one from, or which is a prefix of, one of the names listed in MeSH. The next step is to collect a set of candidate keywords. These are words which (potentially) reflect the issues surrounding the brand names which users find salient. We proceed by marking all posts which contain a mention of at least one brand name, and compute the pointwise mutual information (Church & Hanks 1989) with each term. After calculating this PMI score for each of the 20,505 vocabulary items which occurred at least fifteen times and which are not themselves brand names, we selected the top 5% as potential keywords. This yields 1,001 terms which co-occur with brand name mentions much more frequently than would be expected strictly due to chance. Given a list of candidate keywords, we next construct a set of keyword clusters which reflect the issues potentially associated with brand names in the original corpus. As a first step, we construct a term co-occurrence matrix listing the number of times each word in the corpus occurs within a 15word window of a ‘content bearing word’. For our purposes, a content bearing word is a non-function word which occurs in the corpus with moderately high frequency. Next, we reduce the dimensionality of the term cooccurence matrix using Latent Semantic Analysis, a statistical technique similar to Principle Components Analysis, to reduce the influence of random noise in the data and to extract distributional patterns among words which reflect their semantic relationships (Schütze 1997). In this reduced dimensionality WORDSPACE model, each term is represented as a vector of 100 latent variables which reflect the distribution of terms in the original co-occurence matrix. We next perform a complete-linkage hierarchical clustering of the candidate keywords based on their WORDSPACE representations, at each stage joining a keyword or cluster of keywords with its closest neighbor in the semantic space. We define the semantic distance between two words as the cosine of the angle between their WORDSPACE vectors. The distance between two clusters is the longest pairwise distance between any two members of the clusters. For our list of 1,001 candidate keywords, hierarchical clustering yields a set of 161 keyword clusters containing 999 keywords (keywords which do not fit into any cluster are dropped). Manual inspection of these clusters reveals that many of them represent plausible brand associations. For example, we find a MEMORY cluster that reflects the cognitive side effects of many anti-convulsant medications: loss memory problem cognitive term short concentration speech trouble confusion recall concentrate coor-
منابع مشابه
Text Analytics of Customers on Twitter: Brand Sentiments in Customer Support
Brand community interactions and online customer support have become major platforms of brand sentiment strengthening and loyalty creation. Rapid brand responses to each customer request though inbound tweets in twitter and taking proper actions to cover the needs of customers are the key elements of positive brand sentiment creation and product or service initiative management in the realm of ...
متن کاملAutomatic Discovery of Technology Networks for Industrial-Scale R&D IT Projects via Data Mining
Industrial-Scale R&D IT Projects depend on many sub-technologies which need to be understood and have their risks analysed before the project can begin for their success. When planning such an industrial-scale project, the list of technologies and the associations of these technologies with each other is often complex and form a network. Discovery of this network of technologies is time consumi...
متن کاملDesigning a System for Trend Analysis of Users in Website Surfing in Iran Using Data Mining and Text Mining Algorithms
Background and Aim: As of the entrance of web surfing to the lifestyle of a vast majority of people in the society and the need for a more accurate social and cultural policy making in the field, authors intended to analyze the behavior of the society users in viewing different websites so as to help politicians and practitioners. Methods: Design science research method is used in this research...
متن کاملPrediction of user's trustworthiness in web-based social networks via text mining
In Social networks, users need a proper estimation of trust in others to be able to initialize reliable relationships. Some trust evaluation mechanisms have been offered, which use direct ratings to calculate or propagate trust values. However, in some web-based social networks where users only have binary relationships, there is no direct rating available. Therefore, a new method is required t...
متن کاملDISEASES: text mining and data integration of disease-gene associations.
Text mining is a flexible technology that can be applied to numerous different tasks in biology and medicine. We present a system for extracting disease-gene associations from biomedical abstracts. The system consists of a highly efficient dictionary-based tagger for named entity recognition of human genes and diseases, which we combine with a scoring scheme that takes into account co-occurrenc...
متن کاملMining Web Documents for Unintended Information Revelation
This research concerns web site information security. With an increasing number of documents being generated by different individuals and departments in organizations, there is a potential of releasing information which is inconsistent with the overall goals, objectives and operation of the organization. We refer to this as unintended information revelation (UIR). This paper focuses on progress...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2006